{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pandas import Series, DataFrame\n",
    "\n",
    "s = pd.read_csv('../data/nyc-temps.txt').squeeze()\n",
    "df = DataFrame({'temp': s, \n",
    "                'hour': [0,3,6,9,12,15,18,21] * 91})\n",
    "\n",
    "df.loc[(df['hour'] == 3) | (df['hour'] == 6), 'temp'] = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 1\n",
    "\n",
    "By default, the `interpolate` method tries to average the remaining values before and after any `NaN`. However, we can change how it works, by passing `method='nearest'`. Does that change our data substantially?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>temp</th>\n",
       "      <th>hour</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>728.000000</td>\n",
       "      <td>728.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>-1.050824</td>\n",
       "      <td>10.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>5.026357</td>\n",
       "      <td>6.878589</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-14.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>-4.000000</td>\n",
       "      <td>5.250000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>10.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>15.750000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>12.000000</td>\n",
       "      <td>21.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             temp        hour\n",
       "count  728.000000  728.000000\n",
       "mean    -1.050824   10.500000\n",
       "std      5.026357    6.878589\n",
       "min    -14.000000    0.000000\n",
       "25%     -4.000000    5.250000\n",
       "50%      0.000000   10.500000\n",
       "75%      2.000000   15.750000\n",
       "max     12.000000   21.000000"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# No, doesn't seem to change things significantly -- maybe because temperatures\n",
    "# don't really vary all that much across readings.\n",
    "\n",
    "df.interpolate(method='nearest').describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 2\n",
    "\n",
    "Let's assume that the equipment works fine around the clock, but that it fails to record a reading at -1 degree and below. Are the interpolated values similar to the real (missing) values they replace? Why or why not?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# reset our data\n",
    "df = DataFrame({'temp': s, \n",
    "                'hour': [0,3,6,9,12,15,18,21] * 91})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove values <= -1 to NaN\n",
    "df.loc[df['temp'] <= -1, 'temp'] = np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Interpolate!\n",
    "df = df.interpolate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    721.000000\n",
       "mean       2.022191\n",
       "std        2.345483\n",
       "min        0.000000\n",
       "25%        0.209524\n",
       "50%        1.000000\n",
       "75%        3.000000\n",
       "max       12.000000\n",
       "Name: temp, dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Wow, the mean is now 2 and the median is now 1 -- significantly higher\n",
    "# Not surprising, of course, given that we removed all very low temperatures!\n",
    "df['temp'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 3\n",
    "\n",
    "A cheap solution to interpolation is to replace `NaN` values with the column's mean. Do this, and compare the new mean and median. Again, why are (or aren't) these values similar to the original ones?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# reset our data\n",
    "df = DataFrame({'temp': s, \n",
    "                'hour': [0,3,6,9,12,15,18,21] * 91})\n",
    "\n",
    "# Remove values <= -1 to NaN\n",
    "df.loc[df['temp'] <= -1, 'temp'] = np.nan\n",
    "\n",
    "df = df.fillna(df.mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>temp</th>\n",
       "      <th>hour</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>728.000000</td>\n",
       "      <td>728.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>2.763926</td>\n",
       "      <td>10.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.935689</td>\n",
       "      <td>6.878589</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>5.250000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>2.763926</td>\n",
       "      <td>10.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>2.763926</td>\n",
       "      <td>15.750000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>12.000000</td>\n",
       "      <td>21.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             temp        hour\n",
       "count  728.000000  728.000000\n",
       "mean     2.763926   10.500000\n",
       "std      1.935689    6.878589\n",
       "min      0.000000    0.000000\n",
       "25%      2.000000    5.250000\n",
       "50%      2.763926   10.500000\n",
       "75%      2.763926   15.750000\n",
       "max     12.000000   21.000000"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Wow, these values are even worse than the interpolated ones!\n",
    "\n",
    "# Clearly, running .interpolate is a better option than using the mean -- \n",
    "# in no small part because it calculated a local mean, rather than\n",
    "# a global one across all of the data.\n",
    "\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
